Exploring Bogota venues across all neighborhoods using the Foursquare API - IBM Data Science Capstone project

In this project an exploration of the Bogota neighborhoods will be done based on the similarity in the venues that each neigborhood posses. This similarty is defined as how may similar places places are in each neighborhood, defined as well as the euclidean distance between every tipe of venue in a defined neighborhood.

Data adquirement

The data used in this project is adquired from a local goverment site, where a open database of local features of interest are avaible. Among the data avaible we can find the information about every neighborhood, its location, population for every neighborhood, location of sites of interest as schools and hospital location among others types of data. All this data can be download as csv file.

The structure of the data that will be used in this project is the following.

OBJECTID Locality Code Locality Legal Status Neighborhood Code Latitude Longitude

Where every row of the data file contains information about the Locality (Borough), locality code, neighborhood and its geospacial location among other data.

Once the data is avaible, we begin by reading it as a data frame using pandas:

The first thing to be notice is the existence of the legal status feature. Some neigborhoods in Bogota are not yet legalized to the local autorities, some others are. Those non-legalized neighborhoods are defined as areas of invasion where people start a comunity most of the times in abandoned areas.

Legalized Neighborhhods and data cleaning

Lest see how neihborhoods are distribuited in terms of their legal status.

Almost 2032 small neighborhoods are not legalized in Bogota versus 1801 legalized neigborhoods. An extreme value if the comparision is done.

The first thing to do is separate those legalized neighborhoods from those that are not legalized].

This a familar dataframe and its is almost ready to use. Lets check the data types and lets replace the ',' with a '.' for international convention.

Legalized Neighborhoods - Geographical distribution

Now lets see how the legalized neighborhoods are distributed in the city.

The first thing we can notice is the existence of neighborhoods agglomeration in the peripheral areas of the city. A common issue in Bogota as most of the residencial neighborhoods are located in these zones and the working or office area are located more to the central areas of the city.

Non Legalized Neighborhoods

Lets do the same process as before but now with the non legalized neighborhoods.

Now lets check how is the geographical distribution on legalizaded vs non legalized neighborhoods

Some conclusions can be draw inmediatly from the map above, for example not all non legalized neighborhoods are located in the peripherical areas of the city, some are located inside the city near to legalized neighborhoods.

Some of these called non legalized neighborhoods are conformed of new edifications or new groups of buildings in unused empty areas. For the current project only the legalized neighborhoods will be considered, leaving for an upcoming work analyze the non legalized neighborhoods.

It is now an obligation of the inhabitants of those areas and an obligation for the local authorities to regulate those non legazided neighborhoods, as it's a condition to get state benefits and regulations.


Venue analysis of legalized neighborhoods in Bogota using the Foursquare API

Now a venue analysis of the legalized neighborhoods is done. As first part we need to stablish a connection with the Foursquare API to get the near venues to a specific location, in this case near the coordinates of each neighborhood.

As we can see the most common venue in Bogota according to Foursquare are restaurants, followed by parks which confirms the great interest of the past and presents local goverments to improve this aspect of the city. Next we have burgers joints and in general food venues, as well as a malls, gyms and hotels.

Its worth mention that the request to the API is limted to 100 venues per neighborhood as well as some radius to the central coordinate. Not all the vuenes are returned either for these limitations or because not all venues in Bogota are registered in the Foursquare API.

Now its time to encode the results using dummy variables (0 or 1) to denote is some vuene is present or not. For that we use one hot enconding

Now if some category venue is present it will represented as one otherwise the encoding will be 0.

The above results represent the media of ocurrence of the total of category venues groped by neighborhood, which will be usuful to determine the most frequent venue for each neighborhood as to cluster the neighborhoods using the k-mean algorithm.

Now it is possible to determine which is the top 10 most common value for each neighborhood. This information allow us to clasify each neighborhood acording to the most and less commo venue. For example for the entry 1, the most common venue are the BBQ joints, and the less common is the fish and ships shops, so it is possible to say that this neighborhood its mostly an eating zone. In the same way we can clasify all the neiborhoods.

This is valuable information as it can be used to determine what kind of venue is the most frequent and it can be used for some stakeholder to make a market study.

Classifying the neighborhoods using the K-means algorithm

At this point is possible to use some Machine Learning (ML) algorithms to classify each neigborhood according to the kind of venues present.

Lets start determining which its the best parameter k for use the algorithm.

According to the elbow method a good selection for k is k = 7. As k increases, the sum of squared distance tends to zero. Imagine we set k to its maximum value n (where n is number of samples) each sample will form its own cluster meaning sum of squared distances equals zero.

Now lets see how the cluster looks like in the city map.

The first thing we can notice is the existence of a super cluster (red) 0, among all the city. This cluster is definied by a variety of venues like food venues and hotels as the most frequent venues which seems logic, as there is non specific are in the city for those places, we can find them all across the city.

Next in the cluster purple 1, we can find mostly Construction & Landscaping venues mostly in peripherical areas of the city.

Next in the cluster dark blue 2, we can find mostly Shopping Mall venues with almost no precense of Fabric Shop.

Next in the cluster light blue 3, we can find mostly Grocery Stores venues with almost no precense of restaurants.

And so on.

Detailed clustering of neighborhoods by registered foursquare venues.